122
Applications in Natural Language Processing
FIGURE 5.2
Fully quantized transformer.
estimates computed during training. For every forward pass, xmin and xmax variables are
updated via an exponential moving average with a momentum of 0.9.
During backpropagation, the straight-through estimator [37] is used to Bypass the un-
differentiable round function, and the gradients of clamped values are set to zero.
5.2.2
What to Quantize
They choose to quantize all operations, which can provide a computational speed gain at
inference. The overview is presented in Fig. 5.2. In particular, they quantize all matrix mul-
tiplications, meaning that the inputs and weights of MatMuls will both be b-bit quantized.
The model’s divisions are also quantized as long as the numerator and denominator are